Gender Analisys based on Government's Transparency Portal data


Objective: The purpose of this notebook is to analyse the gender distribution of Federal University of Rio Grande do Norte's employees. The data was scrapped from the federal government transparency portal and the gender is not available in the transparency portal.

Participants:

  • Marco Olimpio - marco.olimpio at gmail
  • Rebecca Betwel - bekbetwel at gmail

Detailed explanation In this notebook we have to analyse the gender name of UFRN's employees. This list of employes could be requested from the Government Transparency Portal - http://www.portaldatransparencia.gov.br/. In this portal there is a list of all government bodys including the employees related to each one. To be precise about the part that concerns UFRN the url requested could be: LINK

In this scenario we have to request and process all 413 pages contaning the names of the UFRN's employees and run an analysis based on the name of then. This method is called web scrapping and we could utilize the Beatfulsoup library to make the scrapping and after run an analysis of gender.

Topics

Web Scrapping


According to Wikipedia (https://en.wikipedia.org/wiki/Web_scraping) web scrapping is a technique in witch the computer programm acquire data from human readble documents from internet. One of the most famous libraries used in python stack is the beautiful soup - https://www.crummy.com/software/BeautifulSoup/. You can easylly install it executing the command:

!pip install beautifulsoup

The scrapper


Making a visit to

we could see that we have a list of people's name and it constitutes the level 1 of information acquired. This list now is constituded of 413 pages. Each person's name has a link to a specification of what the person does, how long it is working in the organization and more and this constitutes the level two of data aquired. You could also notice that in the level two there is a link that points to a level three. The level three show how much the person receives, paycheck, how much taxes the person pays and other info.


The scrapping algorithm made to acquire all these data is
  • Interate throwgh all page data
    • Interate throwgh all person names
      • Retrieve link to level two page
      • Call retrieveinfo_level_two passing URL to level two (this acquire all info about level two page)
        • Call retrieveinfo_level_three passing URL to level three (this acquire all info about level three page)
All processes of acquiring this data took about 6,5 hours. In this process we could detect that people that are medical residents and military were not acquired (level two and level three information only level one)

Importing libraries needed


In [253]:
#Loading libraries needed

#System libraries
import os
import sys

import datetime
from datetime import date
##GeoJson data and services returned info
import json

import re
import requests

import urllib
from urllib.request import urlopen

from bs4 import BeautifulSoup as bs

#Basic libraries for data analysis
import numpy as np
from numpy import random
import pandas as pd


# Loaing visualization libraries
#Jupyter Magic word to inline matplotlib plots
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
#Inline bokeh charts
output_notebook()
from bokeh.io import push_notebook, show, output_notebook, output_file
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.sampledata.commits import data
from bokeh.models import (
    GMapPlot, GMapOptions, ColumnDataSource, Circle, DataRange1d, PanTool, WheelZoomTool, BoxSelectTool, Jitter
)
from bokeh.core.properties import field

#Choropleth necessary libraries
##Necessary to create shapes in folium
from shapely.geometry import Polygon
from shapely.geometry import Point
##Choropleth itself
import folium
##Colormap
from branca.colormap import linear


Loading BokehJS ...

Function to retrieve information of the level two of information


In [30]:
def retrieveinfo_level_two(url):
    '''Retries data from the http://www.portaldatransparencia.gov.br/servidores/OrgaoExercicio-DetalhaServidor.asp '''
    employee_enrollment = 0
    employee_responsability = 0
    employee_responsability_class = 0
    employee_responsability_pattern = 0
    employee_responsability_reference = 0
    employee_responsability_level = 0
    employee_UF = 0
    employee_UORG = 0
    employee_legal_regime = 0
    employee_activit = 0
    employee_absence_from_work = 0
    employee_work_time = 0
    employee_last_job_responsability_modification_date = 0
    employee_job_responsability_nominee_date = 0
    employee_job_responsability_nominee_act = 0
    employee_last_job_responsability_modification_date_body = 0
    employee_entrance_public_service_legal_document = 0
    employee_entrance_public_service_legal_document_number  = 0
    employee_entrance_public_service_legal_document_date = 0
    responsability_description = 0
    responsability_activity = 0
    responsability_activity_parcial = 0
    responsability_UF = 0
    responsability_UORG = 0
    responsability_last_data_change_resp = 0
    responsability_sup_body = 0
    
    try:
        r = requests.post(url)
        s = bs(r.content,"html")

        #First part - retrive
        #print(len(s.find_all('table')))
        if len(s.find_all('table')) <=4:
            rows = s.find_all('table')[1].find_all('tr')

            employee_enrollment                                     = rows[2].findAll('td')[1].get_text(strip=True)
            employee_responsability                                 = rows[3].findAll('td')[1].get_text(strip=True)
            employee_responsability_class                           = rows[4].findAll('td')[1].get_text(strip=True)
            employee_responsability_pattern                         = rows[5].findAll('td')[1].get_text(strip=True)
            employee_responsability_reference                       = rows[6].findAll('td')[1].get_text(strip=True)
            employee_responsability_level                           = rows[7].findAll('td')[1].get_text(strip=True)

            employee_UF                                             = rows[9].findAll('td')[1].get_text(strip=True)
            employee_UORG                                           = rows[10].findAll('td')[1].get_text(strip=True)

            employee_legal_regime                                   = rows[17].findAll('td')[1].get_text(strip=True)
            employee_activit                                        = rows[18].findAll('td')[1].get_text(strip=True)
            employee_absence_from_work                              = rows[19].findAll('td')[1].get_text(strip=True)
            employee_work_time                                      = rows[20].findAll('td')[1].get_text(strip=True)

            employee_last_job_responsability_modification_date      = rows[21].findAll('td')[1].get_text(strip=True)
            employee_job_responsability_nominee_date                = rows[22].findAll('td')[1].get_text(strip=True)
            employee_job_responsability_nominee_act                 = rows[23].findAll('td')[1].get_text(strip=True)
            employee_last_job_responsability_modification_date_body = rows[24].findAll('td')[1].get_text(strip=True)

            employee_entrance_public_service_legal_document        = rows[27].findAll('td')[1].get_text(strip=True)
            employee_entrance_public_service_legal_document_number = rows[28].findAll('td')[1].get_text(strip=True)
            employee_entrance_public_service_legal_document_date   = rows[29].findAll('td')[1].get_text(strip=True)
        else:
            #Employee
            rows = s.find_all('table')[3].find_all('tr')

            employee_enrollment                                     = rows[1].findAll('td')[1].get_text(strip=True)
            employee_responsability                                 = rows[2].findAll('td')[1].get_text(strip=True)
            employee_responsability_class                           = rows[3].findAll('td')[1].get_text(strip=True)
            employee_responsability_pattern                         = rows[4].findAll('td')[1].get_text(strip=True)
            employee_responsability_reference                       = rows[5].findAll('td')[1].get_text(strip=True)
            employee_responsability_level                           = rows[6].findAll('td')[1].get_text(strip=True)

            employee_UF                                             = ''#rows[9].findAll('td')[1].get_text(strip=True)
            employee_UORG                                           = rows[8].findAll('td')[1].get_text(strip=True)

            employee_legal_regime                                   = rows[16].findAll('td')[1].get_text(strip=True)
            employee_activit                                        = rows[17].findAll('td')[1].get_text(strip=True)
            employee_absence_from_work                              = rows[18].findAll('td')[1].get_text(strip=True)
            employee_work_time                                      = rows[19].findAll('td')[1].get_text(strip=True)

            employee_last_job_responsability_modification_date      = rows[20].findAll('td')[1].get_text(strip=True)
            employee_job_responsability_nominee_date                = rows[21].findAll('td')[1].get_text(strip=True)
            employee_job_responsability_nominee_act                 = rows[22].findAll('td')[1].get_text(strip=True)
            employee_last_job_responsability_modification_date_body = rows[23].findAll('td')[1].get_text(strip=True)

            employee_entrance_public_service_legal_document        = rows[26].findAll('td')[1].get_text(strip=True)
            employee_entrance_public_service_legal_document_number = rows[27].findAll('td')[1].get_text(strip=True)
            employee_entrance_public_service_legal_document_date   = rows[28].findAll('td')[1].get_text(strip=True)

            #Responsabilitie
            rows = s.find_all('table')[2].find_all('tr')

            responsability_description                              = rows[3].findAll('td')[1].get_text(strip=True)
            responsability_activity                                 = rows[4].findAll('td')[1].get_text(strip=True)
            responsability_activity_parcial                         = rows[6].findAll('td')[1].get_text(strip=True)

            responsability_UF                                       = rows[7].findAll('td')[1].get_text(strip=True)
            employee_UF = responsability_UF
            responsability_UORG                                     = rows[9].findAll('td')[1].get_text(strip=True)
            responsability_sup_body                                 = rows[10].findAll('td')[1].get_text(strip=True)

            responsability_last_data_change_resp                    = rows[19].findAll('td')[1].get_text(strip=True)


        #Second part - retrive link to paycheck info and retrieve info about it
        link = s.findAll("a", { "title" : "Remuneração individual do servidor" })
        url_level_three = 'http://www.portaldatransparencia.gov.br' + link[0].get('href')
        #print(url_level_three)
        print("    L2 - OK")

        return [employee_enrollment, employee_responsability, employee_responsability_class, employee_responsability_pattern, employee_responsability_reference, employee_responsability_level, employee_UF, employee_UORG, employee_legal_regime, employee_activit, employee_absence_from_work, employee_work_time, employee_last_job_responsability_modification_date, employee_job_responsability_nominee_date, employee_job_responsability_nominee_act, employee_last_job_responsability_modification_date_body, employee_entrance_public_service_legal_document, employee_entrance_public_service_legal_document_number , employee_entrance_public_service_legal_document_date, responsability_description, responsability_activity, responsability_activity_parcial, responsability_UF, responsability_UORG, responsability_sup_body, responsability_last_data_change_resp, url_level_three]
    except:
        pass
        print("    L2 - NOINFO/FAIL URL: " + url)
    
    return [0, '', 0, 0, 0, 0, '', '', '', '', '', '', '', '', '', '', '', 0, '', '', '', '', '', '', '', '', '']

Function to retrieve level three of information


In [28]:
def retrieveinfo_level_three(url):
    '''Retries data from http://www.portaldatransparencia.gov.br/servidores/Servidor-DetalhaRemuneracao.asp'''
    emp_year = 0
    emp_tot_paycheck = 0
    emp_event_paycheck = 0
    emp_13_paycheck = 0
    emp_paid_vacation = 0
    emp_other = 0
    emp_irrf = 0
    emp_rgps = 0
    emp_paycheck_after_deduction = 0
    emp_other_receivings = 0
    emp_other_deduction = 0
    month_dict = {'janeiro':1, 'fevereiro':2, 'março':3, 'abril':4, 'maio':5, 'junho':6, 'julho':7, 'agosto':8, 'setembro':9, 'outubro':10, 'novembro':11, 'dezembro':12}

    try:
        r = requests.post(url)
        s = bs(r.content,"html")
        rows = s.find_all('tbody')[1].find_all('tr')
        counterPaycheck = 0
        emp_month = 9
        emp_year = 2017
        
        if len(s.find_all('tbody')[1].find_all('tr'))<10:
            print("        L3 OK - NO INFO")
            return [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    
        else:
            #Remuneração básica bruta
            if(len(rows[4].findAll('td')[2].get_text(strip=True))>0):
                emp_tot_paycheck = re.sub('[\.]+','',rows[4].findAll('td')[2].get_text(strip=True))
                emp_tot_paycheck = re.sub(',','.',emp_tot_paycheck)
            else:
                emp_tot_paycheck = 0.00

            if(len(rows[7].findAll('td')[2].get_text(strip=True))>0):
                    emp_irrf = re.sub('[\.]+','',rows[7].findAll('td')[2].get_text(strip=True))
                    emp_irrf = re.sub(',','.',emp_irrf)
            else:
                emp_irrf = 0.00

            if(len(rows[8].findAll('td')[2].get_text(strip=True))>0):
                emp_rgps = re.sub('[\.]+','',rows[8].findAll('td')[2].get_text(strip=True))
                emp_rgps = re.sub(',','.',emp_rgps)
            else:
                emp_rgps = 0.00

            if len(s.find_all('tbody')[1].find_all('tr'))==21:

                if(len(rows[11].findAll('td')[2].get_text(strip=True))>0):
                    emp_other_deduction = re.sub('[\.]+','',rows[11].findAll('td')[2].get_text(strip=True))
                    emp_other_deduction = re.sub(',','.',emp_other_deduction)
                else:
                    emp_rgps = 0.00

                if(len(rows[13].findAll('td')[1].get_text(strip=True))>0):
                    emp_paycheck_after_deduction = re.sub('[\.]+','',rows[13].findAll('td')[1].get_text(strip=True))
                    emp_paycheck_after_deduction = re.sub(',','.',emp_paycheck_after_deduction)
                else:
                    emp_paycheck_after_deduction = 0.00

                if(len(rows[18].findAll('td')[2].get_text(strip=True))>0):
                    emp_other_receivings = re.sub('[\.]+','',rows[18].findAll('td')[2].get_text(strip=True))
                    emp_other_receivings = re.sub(',','.',emp_other_receivings)    
                else:
                    emp_other_receivings = 0.00

            else:

                if(len(rows[10].findAll('td')[1].get_text(strip=True))>0):
                    emp_paycheck_after_deduction = re.sub('[\.]+','',rows[10].findAll('td')[1].get_text(strip=True))
                    emp_paycheck_after_deduction = re.sub(',','.',emp_paycheck_after_deduction)
                else:
                    emp_paycheck_after_deduction = 0.00

                if(len(rows[15].findAll('td')[2].get_text(strip=True))>0):
                    emp_other_receivings = re.sub('[\.]+','',rows[15].findAll('td')[2].get_text(strip=True))
                    emp_other_receivings = re.sub(',','.',emp_other_receivings)    
                else:
                    emp_other_receivings = 0.00
            
        print("        L3 OK")
        url=""
        return [emp_month, emp_year, emp_tot_paycheck, emp_irrf, emp_rgps, emp_other_deduction, emp_paycheck_after_deduction, emp_other_receivings]
    except:        
        url=""
        print("        Level THREE - FAIL URL: " + url)
    
    return [0, 0, 0, 0, 0, 0, 0, 0]

Main part of the web scrapping. The iteration about the employees of UFRN.


In [ ]:
#store data crawled
ufrnEmployeList = []
counter= 0
debug = False
#This method iterates through all the pages from the UFRN transparency page
try:
    for x in range(1,413):
        url = "http://www.portaldatransparencia.gov.br/servidores/OrgaoExercicio-ListaServidores.asp?CodOS=15000&DescOS=MINISTERIO%20DA%20EDUCACAO&CodOrg=26243&DescOrg=UNIVERSIDADE%20FED.%20DO%20RIO%20GRANDE%20DO%20NORTE&Pagina="+ str(x) +"+&TextoPesquisa="
        print("L1 - " + str(x) +'/413 - ' + str((x/413)*100) + '% pages loaded...0')
        r = requests.post(url)
        s = bs(r.content,"html")

        rows = s.find_all('table')[1].find_all('tr')

        counter= 0
        # Iterate through all lines 'tr'
        for row in rows:

            counter = counter + 1
            if counter == 1:
                continue

            #Take all the columns
            tdList = row.findAll('td')

            #Employee's CPF
            cpf = tdList[0].get_text(strip=True)

            #Employee's detail link
            hrefList = tdList[1].find_all('a')
            empHref = hrefList[0].get('href')

            #Employee's name
            name = tdList[1].get_text(strip=True)

            aux = [cpf, empHref, name]
            print("  L2")
            return_level_two = retrieveinfo_level_two("http://www.portaldatransparencia.gov.br/servidores/" + str(empHref))
            print("    L3")
            if(len(return_level_two[-1])>150):
                return_level_three = retrieveinfo_level_three(return_level_two[-1])
            else:
                print("    L2 - No link to L3")
                
            aux = aux + return_level_two + return_level_three
            ufrnEmployeList.append(aux)
            
            if debub:
                break
        if debub:
            break

except:
    print("L1 - Issues on page " + str(x) + " Line: " + str(counter) + " URL: " + str(url))
    pass
    
    
print("Number of employees aquired: " + str(len(ufrnEmployeList)))
for em in ufrnEmployeList:
    print(em)

In [41]:
# Transforming a list into da Pandas Data Frame
dfUfrnComplete = pd.DataFrame.from_records(ufrnEmployeList, columns=['cpf','hrefLevel2','name', 'emp_enrollment', 'emp_responsability', 'emp_responsability_class', 'emp_responsability_pattern', 'emp_responsability_reference', 'emp_responsability_level', 'emp_UF', 'emp_UORG', 'emp_legal_regime', 'emp_activit', 'emp_absence_from_work', 'emp_work_time', 'emp_last_job_responsability_modification_date', 'emp_job_responsability_nominee_date', 'emp_job_responsability_nominee_act', 'emp_last_job_responsability_modification_date_body', 'emp_entrance_public_service_legal_document', 'emp_entrance_public_service_legal_document_number', 'emp_entrance_public_service_legal_document_date', 'responsability_description', 'responsability_activity', 'responsability_activity_parcial', 'responsability_UF', 'responsability_UORG', 'responsability_sup_body', 'responsability_last_data_change_resp', 'url_level_three', 'emp_month', 'emp_year', 'emp_tot_paycheck', 'emp_irrf', 'emp_rgps', 'emp_other_deduction', 'emp_paycheck_after_deduction', 'emp_other_receivings', 'e1', 'e2', 'e3'])

Going deeper...


Well, the data acquired above are restricted to name and part of the CPF number. With this two data we made analysis about the gender and from where they have born. Know its time to make more and to make this we need more data related. In the first step we already have required a link with a detailed information about the UFRN's employee. So we have more two levels of crawling here, like this:

  • Level one List of employee name
    • Level two: Data about location of the employee and the career info.
      • Level three: Data about the paycheck

The first one, the one that we have already crawled. The second level could be accesed by the link from the link in the first level, this link is alread caugth in the DataFrame. We have to iterate this DataFrame and make the request from this link. The third level, the link to it is in the level two and we will request infor right after catch the infor of the level two one by one,

  • Emp. A
    • level 2 -> level 3
  • Emp. B
    • level 2 -> level 3
  • and so on

Lets go...

Warning!


After do a crawling aways remember to save your data... This notebook cracked once and I've lost everything :P


In [42]:
dfUfrnComplete.to_csv('ufrnEmployeeList_20112017.csv',sep=',')

In [45]:
dfUfrnComplete['first_name'] = dfUfrnComplete['name'].str.split(' ').str[0]
dfUfrnComplete['last_name'] = dfUfrnComplete['name'].str.split(' ').str[-1]

In [96]:
dfUfrn['first_name'].value_counts(sort=True)


Out[96]:
MARIA        292
JOSE         194
FRANCISCO    125
ANA          117
JOAO          95
LUIZ          64
CARLOS        63
ANTONIO       59
PAULO         49
PEDRO         40
MARCOS        39
LUCIANA       37
RICARDO       37
RODRIGO       37
DANIEL        35
ANDRE         35
BRUNO         33
RAFAEL        32
JULIANA       32
SERGIO        32
MARCELO       30
ALEXANDRE     28
EDUARDO       28
RENATA        28
MARCIO        26
FERNANDO      25
ADRIANA       25
LEONARDO      25
FERNANDA      24
MARCIA        24
            ... 
ASKERY         1
ACYNELLY       1
NAYANA         1
ADALGIZA       1
SUELI          1
LUANN          1
JAMILE         1
CLAUDIANA      1
DAMIANA        1
ILKA           1
DESIO          1
LADDYLA        1
RANIELLE       1
ARLINDA        1
WALDENICE      1
TAYSSA         1
VERNER         1
MIDORI         1
MOALDECIR      1
DELSON         1
ERIDJA         1
NEYRE          1
MILENE         1
IRANI          1
MATTHIEU       1
JOELZA         1
EMILY          1
ELENA          1
CORDELIA       1
ARRISON        1
Name: first_name, dtype: int64

In [46]:
dfUfrnComplete['first_name'].value_counts(sort=True)


Out[46]:
MARIA         292
JOSE          194
FRANCISCO     125
ANA           117
JOAO           95
LUIZ           64
CARLOS         63
ANTONIO        59
PAULO          49
PEDRO          40
MARCOS         39
LUCIANA        37
RICARDO        37
RODRIGO        37
ANDRE          35
DANIEL         35
BRUNO          33
RAFAEL         32
JULIANA        32
SERGIO         32
MARCELO        30
ALEXANDRE      28
RENATA         28
EDUARDO        28
MARCIO         26
LEONARDO       25
ADRIANA        25
FERNANDO       25
FERNANDA       24
MARCIA         24
             ... 
SUZIANE         1
LEILIANNE       1
ELIDETE         1
ELEDIR          1
JUDITHE         1
HELENICE        1
ARLENE          1
LAIZE           1
ANAYSI          1
NALVA           1
MILKIA          1
GIZELMA         1
UILIETE         1
MUIRAKYTAN      1
ROZANGELA       1
ANDREO          1
MACILON         1
ERONILSON       1
THELMA          1
ONELIA          1
ANILZA          1
EDILEUZA        1
VIRGILIO        1
NATERCIO        1
THAISSA         1
ANTHONY         1
LUCINEIA        1
VALTEMIA        1
ALDAYR          1
NIOMAR          1
Name: first_name, dtype: int64

In [47]:
dfUfrnComplete.head(10)


Out[47]:
cpf hrefLevel2 name emp_enrollment emp_responsability emp_responsability_class emp_responsability_pattern emp_responsability_reference emp_responsability_level emp_UF ... emp_irrf emp_rgps emp_other_deduction emp_paycheck_after_deduction emp_other_receivings e1 e2 e3 first_name last_name
0 ***.094.550-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ABDO FARRET NETO 167**** PROFESSOR DO MAGISTERIO SUPERIOR 6 601 DEPARTAMENTO DE MEDICINA INTEGRADA ... 0 0 0 0 0 0.0 0.0 0.0 ABDO NETO
1 ***.942.324-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ABMAEL BEZERRA DE OLIVEIRA 034**** PROFESSOR DO MAGISTERIO SUPERIOR 6 604 DEPARTAMENTO DE ENGENHARIA ELETRICA ... 0.00 0.00 0 0 0 NaN NaN NaN ABMAEL OLIVEIRA
2 ***.270.284-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ABNER CARLOS COSTA DE MELO 215**** TECNICO EM ELETROTECNICA D 103 DEPTO DE FISICA TEORICA E EXPERIMENTAL ... -53.33 -290.57 0 2561.80 458.00 NaN NaN NaN ABNER MELO
3 ***.768.074-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ABRAHAO SANDERSON NUNES FERNANDES DA SILVA 166**** PROFESSOR DO MAGISTERIO SUPERIOR 6 601 DEPARTAMENTO DE HISTORIA-CERES ... -1745.70 -1245.60 -64.20 8268.21 1423.02 NaN NaN NaN ABRAHAO SILVA
4 ***.013.113-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ACACIA JESSICA MAIA DE MOURA 204**** ASSISTENTE SOCIAL E 403 ESCOLA AGRICOLA DE JUNDIAI - UAECA ... -772.98 -608.44 0 5199.19 808.70 NaN NaN NaN ACACIA MOURA
5 ***.793.704-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ACACIO MEDEIROS NOGUEIRA 034**** TECNOLOGO-FORMACAO E 416 PRO-REITORIA DE ASSUNTOS ESTUDANTIS ... 0.00 0.00 0 0 0 NaN NaN NaN ACACIO NOGUEIRA
6 ***.245.464-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ACILENE MARIA BORGES GOMES OLIVEIRA 034**** AUX EM ADMINISTRACAO C 416 GABINETE DO REITOR ... 0.00 0.00 0 0 0 NaN NaN NaN ACILENE OLIVEIRA
7 ***.827.884-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ACSA NARA DE ARAUJO BRITO BARROS 240**** NUTRICIONISTA-HABILITACAO E 101 FACULDADE DE CIENCIAS DA SAUDE DO TRAIRI ... -409.54 -597.83 0 4427.48 1495.85 NaN NaN NaN ACSA BARROS
8 ***.264.124-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ACYNELLY DAFNE DA SILVA NUNES 0 0 0 0 0 ... -409.54 -597.83 0 4427.48 1495.85 NaN NaN NaN ACYNELLY NUNES
9 ***.658.549-** OrgaoExercicio-DetalhaServidor.asp?IdServidor=... ADA CRISTINA SCUDELARI 122**** PROFESSOR DO MAGISTERIO SUPERIOR 8 801 DEPARTAMENTO DE ENGENHARIA CIVIL ... -3939.77 -2161.40 0 13548.01 579.14 NaN NaN NaN ADA SCUDELARI

10 rows × 43 columns

So after realizing the scrapping we have acquired the following informations:


  • cpf - CPF stands for 'Cadastro de Pessoas Único' that is a unique number for
  • cpf_region - The nascence region from the CPF
  • name - Name of the employee
  • emp_enrollment - Enrollment number
  • emp_responsability - Job responsability
  • emp_responsability_class - Job class number
  • emp_responsability_pattern - Job patter number
  • emp_responsability_reference - Job reference number
  • emp_responsability_level -
  • emp_UF - State from the location
  • emp_UORG -
  • emp_legal_regime -
  • emp_activit -
  • emp_absence_from_work -
  • emp_work_time -
  • emp_last_job_responsability_modification_date -
  • emp_job_responsability_nominee_date -
  • emp_job_responsability_nominee_act -
  • emp_last_job_responsability_modification_date_body -
  • emp_entrance_public_service_legal_document -
  • emp_entrance_public_service_legal_document_number -
  • emp_entrance_public_service_legal_document_date -
  • responsability_description -
  • responsability_activity -
  • responsability_activity_parcial -
  • responsability_UF -
  • responsability_UORG -
  • responsability_sup_body -
  • responsability_last_data_change_resp -
  • hrefLevel2 - URL for level two
  • url_level_three - URL for level three
  • emp_month - Month of the paycheck
  • emp_year - Year of the paycheck
  • emp_tot_paycheck - Total of the paycheck
  • emp_irrf - Income taxes
  • emp_rgps - Retirement taxes
  • emp_other_deduction - Other taxes
  • emp_paycheck_after_deduction - Income after deductions
  • emp_other_receivings - Other values that do not pay taxes
  • first_name - First name of employee
  • last_name - Last name of employee
  • gender_namsor - Return from the NamSor service
  • gender_namsor_adjusted - Adjustment for unkown names 'manually' made
  • gender_pygenderbr - Return from the pygenderbr
  • gender_genderguesser - Return from the genderguesser

In [190]:
dfUfrnComplete.columns


Out[190]:
Index(['cpf', 'hrefLevel2', 'name', 'emp_enrollment', 'emp_responsability',
       'emp_responsability_class', 'emp_responsability_pattern',
       'emp_responsability_reference', 'emp_responsability_level', 'emp_UF',
       'emp_UORG', 'emp_legal_regime', 'emp_activit', 'emp_absence_from_work',
       'emp_work_time', 'emp_last_job_responsability_modification_date',
       'emp_job_responsability_nominee_date',
       'emp_job_responsability_nominee_act',
       'emp_last_job_responsability_modification_date_body',
       'emp_entrance_public_service_legal_document',
       'emp_entrance_public_service_legal_document_number',
       'emp_entrance_public_service_legal_document_date',
       'responsability_description', 'responsability_activity',
       'responsability_activity_parcial', 'responsability_UF',
       'responsability_UORG', 'responsability_sup_body',
       'responsability_last_data_change_resp', 'url_level_three', 'emp_month',
       'emp_year', 'emp_tot_paycheck', 'emp_irrf', 'emp_rgps',
       'emp_other_deduction', 'emp_paycheck_after_deduction',
       'emp_other_receivings', 'e1', 'e2', 'e3', 'first_name', 'last_name',
       'gender_pygenderbr', 'gender_namsor', 'cpf_region',
       'gender_namsor_adjusted', 'gender_genderguesser'],
      dtype='object')

Namsor


The first tool we used to collect gender information is the Namsor - http://www.namsor.com/. That is a paid service but its is possible to test. It is really simple to use the service you just need to call the API like:
https://api.namsor.com/onomastics/api/json/gender/FIRST_NAME/LAST_NAME/COUNTRY

like

https://api.namsor.com/onomastics/api/json/gender/Marco/Oliveira/br
The free API has a limit of 1000 requests per month but in this experiment we realized about 12000 requests, made the request two times beacause I've lost the first time data.
So, to proceed and make the requests for every name we simply made a function to make a request and called the apply function of the Pandas DataFrame as we can see below.

In [62]:
def request_gender_namsor(row):
    #Example https://api.namsor.com/onomastics/api/json/gender/Marco/Oliveira/br
    try:
        url = 'https://api.namsor.com/onomastics/api/json/gender/'+row['first_name']+'/'+row['last_name']+'/br'
        print(url)
        response = urlopen(url)
        decoded = response.read().decode('utf-8')
        data = json.loads(decoded)
        return data['gender']
    except:
        return ''

In [ ]:
dfUfrnComplete['gender_namsor'] = dfUfrnComplete.apply(request_gender_namsor, axis=1)

In [65]:
dfUfrnComplete['gender_namsor'].value_counts(sort=True)


Out[65]:
male       3066
female     2854
unknown     246
             14
Name: gender_namsor, dtype: int64

As we can see above Namsor service contegorized 5920 names as male or female and the other 260 names were not categorized, or categorized as 'unknown'.


In [192]:
dfUfrnComplete['gender_namsor'].value_counts(sort=True).plot.bar()


Out[192]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c8f0780>

In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)


Out[78]:
EUSAMAR      1
MARGARIDA    1
PAULO        1
JULIANA      1
SAVIO        1
SEBASTIAN    1
SANDRO       1
ANDERSON     1
UMBERTO      1
MURILLO      1
ADALBERTO    1
DANIEL       1
MARCUS       1
TATIANE      1
Name: first_name, dtype: int64

We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:

  • Aiko
  • Amparo
  • Anunciação
  • Assunção
  • Calipso
  • Carmo
  • Carminho
  • Cléo
  • Charo
  • Clio
  • Conceição
  • Consolação
  • Consuelo
  • Dido
  • Erato
  • Índigo
  • Ino
  • Io
  • Izaro
  • Juno
  • Keiko
  • Lerato
  • Leto
  • Lilo
  • Lucero
  • Margô
  • Mirto
  • Purificação
  • Rocío
  • Rosário
  • Rosarinho
  • Socorro
  • Tamiko
  • Tariro
  • Temisto
  • Yoko
so for this we could seach for the first name:


In [111]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO | AMPARO | ANUNCIAÇÃO | ASSUNÇÃO | CALIPSO | CARMO  | CARMINHO | CLÉO | CHARO | CLIO | CONCEIÇÃO | CONSOLAÇÃO | CONSUELO | DIDO | ERATO | ÍNDIGO | INO | IO | IZARO | JUNO | KEIKO | LERATO | LETO | LILO | LUCERO | MARGÔ | MIRTO | PURIFICAÇÃO | ROCÍO | ROSÁRIO | ROSARINHO | SOCORRO | TAMIKO | TARIRO | TEMISTO | YOKO")==True]


Out[111]:
cpf hrefLevel2 name emp_enrollment emp_responsability emp_responsability_class emp_responsability_pattern emp_responsability_reference emp_responsability_level emp_UF ... emp_paycheck_after_deduction emp_other_receivings e1 e2 e3 first_name last_name gender_pygenderbr gender_namsor cpf_region

0 rows × 46 columns

as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names


In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']

In [155]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'

In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[159]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

In [168]:
#dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY | AISLANIA | ALCIONE | ALMARIA | AMALUSIA | ANEIDE | ANELLYSA | AURIGENA | AUZELIVIA | BARNORA | CLAUDIANNY | CRISLUCI | CRISTHIANNE | DERISCLEIA | EDILZA | EDZANA | SANZIA | SAONARA | SEMELY | SHEYLENA | SISLLEY | SONAYDY | SORANEIDE | SUELENE | SUENI | SUENIA | SULEMI | SUZERICA | TAIZA | THAISE | THAIZA | THATYANE | VALDECY | VALDENIA | WALANNE | WALDENICE | WANDERLEIA | WANUSIA | WICLIFFE | YULYANNA | ZORAIDE ")==True]['gender_namsor_adjusted'] = 'female'
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'


/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[161]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)


Out[78]:
EUSAMAR      1
MARGARIDA    1
PAULO        1
JULIANA      1
SAVIO        1
SEBASTIAN    1
SANDRO       1
ANDERSON     1
UMBERTO      1
MURILLO      1
ADALBERTO    1
DANIEL       1
MARCUS       1
TATIANE      1
Name: first_name, dtype: int64

We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:

  • Aiko
  • Amparo
  • Anunciação
  • Assunção
  • Calipso
  • Carmo
  • Carminho
  • Cléo
  • Charo
  • Clio
  • Conceição
  • Consolação
  • Consuelo
  • Dido
  • Erato
  • Índigo
  • Ino
  • Io
  • Izaro
  • Juno
  • Keiko
  • Lerato
  • Leto
  • Lilo
  • Lucero
  • Margô
  • Mirto
  • Purificação
  • Rocío
  • Rosário
  • Rosarinho
  • Socorro
  • Tamiko
  • Tariro
  • Temisto
  • Yoko
so for this we could seach for the first name:


In [ ]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO|AMPARO|ANUNCIAÇÃO|ASSUNÇÃO|CALIPSO|CARMO|CARMINHO|CLÉO|CHARO|CLIO|CONCEIÇÃO|CONSOLAÇÃO|CONSUELO|DIDO|ERATO|ÍNDIGO|INO|IO|IZARO|JUNO|KEIKO|LERATO|LETO|LILO|LUCERO|MARGÔ|MIRTO|PURIFICAÇÃO|ROCÍO|ROSÁRIO|ROSARINHO|SOCORRO|TAMIKO|TARIRO|TEMISTO|YOKO")==True]

as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names


In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']

In [155]:
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'

In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[159]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

In [168]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'


/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

Adjusting failed retried names


In [474]:
dfUfrnComplete.loc[(dfUfrnComplete['gender_namsor'] == ''),'gender_namsor_adjusted'] = 'unknown'


Out[474]:
male       3131
female     2854
unknown     195
Name: gender_namsor_adjusted, dtype: int64

In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[161]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

(Py)Gender BR



In [48]:
from pygenderbr import Gender
gapi = Gender()

For purposes of data adjustment we changed the return from 'M' to 'male', 'F' to 'female' and '' to 'unknown'


In [338]:
def request_gender_pygenderbr(row):
    nome = row['first_name']
    gender_api = gapi.getgender(nome)
    gender_return = ''
    
    if gender_api[0] == 'M':
        gender_return = 'male'
    elif gender_api[0] == 'F':
        gender_return = 'female'
    elif gender_api[0].strip() == '':
        gender_return = 'unknown'
    else:
        gender_return = gender_api[0]
        
    print("Nome: " + nome + " Gender: " + gender_return + "   | ", end="")
    return gender_return

In [358]:
print("Stating to request Py GenderBR")
start_time = datetime.datetime.now()
print(start_time)
dfUfrnComplete['gender_pygenderbr'] = dfUfrnComplete.apply(request_gender_pygenderbr, axis=1)
print("Finished to request Py GenderBR")
finish_time = datetime.datetime.now()
print(finish_time)


Stating to request Py GenderBR
2017-11-30 16:26:35.480062
Finished to request Py GenderBR
2017-11-30 21:50:50.883395

In [368]:
print("Total time to complete gender requests: ")
print(finish_time-start_time)


Total time to complete gender requests: 
5:24:15.403333

In [359]:
dfUfrnComplete['gender_pygenderbr'].value_counts()


Out[359]:
male       2977
female     2791
unknown     412
Name: gender_pygenderbr, dtype: int64

In [372]:
dfUfrnComplete['gender_pygenderbr_adjusted'] = dfUfrnComplete['gender_pygenderbr']

Adjusting male names


In [394]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O')    & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON')  & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U')    & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US')  & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS')  & (dfUfrnComplete['gender_pygenderbr_adjusted'] =='unknown'),'gender_pygenderbr_adjusted'] = 'male'

Adjusting female names


In [408]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE"),'gender_pygenderbr_adjusted'] = 'female'

In [406]:
dfUfrnComplete[dfUfrnComplete['gender_pygenderbr']=='unknown']['first_name']


Out[406]:
8         ACYNELLY
28         ADEILZE
32         ADELENA
39        ADENYLZA
48            ADIR
50         ADLAREG
121          AIRAN
127       AISLANIA
129        AKALINE
142       ALBANYRA
146        ALBIMAR
147        ALBIMAR
152        ALCIONE
155         ALDACI
157         ALDAYR
164      ALEKSANDR
230        ALIANDA
264          ALMAN
265        ALMARIA
269        ALTAIVA
282      ALVESSENA
288       ALZIVANY
289         AMADJA
290       AMALUSIA
308        AMANDDA
309         AMANDY
319         AMIRIA
321         AMOYSA
446         ANAYSI
468      ANDERSONN
           ...    
5922       VALDECI
5924       VALDECY
5928    VALDENIDES
5949      VALTEMIA
5958      VANDINER
5989         VEDER
5990     VELASQUEZ
6011        VESCIO
6012         VIANA
6061     WADLEIGHN
6069       WALANNE
6080      WALNESIA
6081       WALTECA
6094       WANUSIA
6098       WATTSON
6118      WICLIFFE
6119        WIEBKE
6120        WILACI
6139    WILSYNNARA
6140      WINIFRED
6142        WOUBER
6146         YANAK
6154      YULYANNA
6163          ZAMA
6166    ZANKENNEDY
6170      ZENEWTON
6172           ZEU
6175        ZILMAR
6177    ZIVANILSON
6179        ZORANO
Name: first_name, dtype: object

In [409]:
dfUfrnComplete['gender_pygenderbr_adjusted'].value_counts()


Out[409]:
male       3057
female     2817
unknown     306
Name: gender_pygenderbr_adjusted, dtype: int64

Gender Guesser


Another library tested is the gender-guesser. This library is a fork of the sexmachine library, no longer maintained as we https://pypi.python.org/pypi/gender-guesser/#downloads

!pip install gender-guesser

In [169]:
import gender_guesser.detector as gender

In [184]:
def requestgender_genderguesser(row):
    try:
        d = gender.Detector()
        name = row['first_name']
        return d.get_gender(name.title())
    except:
        return 'unknown'

In [185]:
dfUfrnComplete['gender_genderguesser'] = dfUfrnComplete.apply(requestgender_genderguesser, axis=1)

In [187]:
dfUfrnComplete['gender_genderguesser'].value_counts(sort=True)


Out[187]:
male             2164
female           2069
unknown          1922
andy               10
mostly_female       8
mostly_male         7
Name: gender_genderguesser, dtype: int64

The Gender Guesser API utilizes more classification values as its return, so we made a adjustment to make all three tools/services similar in output. The Gender Guesser output andy, mostly_female and mostly_male will be categorized as unknown.


In [365]:
dfUfrnComplete['gender_genderguesser_adjusted'] = dfUfrnComplete['gender_genderguesser']
dfUfrnComplete.loc[(dfUfrnComplete['gender_genderguesser_adjusted'] == 'andy') | (dfUfrnComplete['gender_genderguesser_adjusted'] =='mostly_female') | (dfUfrnComplete['gender_genderguesser_adjusted'] =='mostly_male'),'gender_genderguesser_adjusted'] = 'unknown'

In [367]:
dfUfrnComplete['gender_genderguesser_adjusted'].value_counts(sort=True)


Out[367]:
male       2164
female     2069
unknown    1947
Name: gender_genderguesser_adjusted, dtype: int64

Adjusting male names, only three names according to this rule were found.


In [411]:
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O')    & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON')  & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U')    & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US')  & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'
dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS')  & (dfUfrnComplete['gender_genderguesser_adjusted'] =='unknown'),'gender_genderguesser_adjusted'] = 'male'

Adjusting female names 48 known female names found but this time only one was considered female


In [420]:
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE"),'gender_genderguesser_adjusted'] = 'female'

In [422]:
dfUfrnComplete['gender_genderguesser'].value_counts(sort=True)


Out[422]:
male             2164
female           2069
unknown          1922
andy               10
mostly_female       8
mostly_male         7
Name: gender_genderguesser, dtype: int64

In [423]:
dfUfrnComplete['gender_genderguesser_adjusted'].value_counts(sort=True)


Out[423]:
male       2758
female     2116
unknown    1306
Name: gender_genderguesser_adjusted, dtype: int64

Comparing tools


As we can see below, after common adjusts made to all tools (male names and female known names finished in the general rule of male names) we finished with the folowing results. Notice that the Gender Guesser is the tool with more unknown results followed by Gender BR. It's important to remember that the Namsor asks for the country origin of the name. The gender br are based on Brazil's CENSUS and the gender guesser is a group of data from some coutries to try to gueess the gender of the name.


In [525]:
result1 = dfUfrnComplete[['gender_namsor_adjusted','gender_pygenderbr_adjusted','gender_genderguesser_adjusted']]
result1 = result1.apply(lambda x: x.value_counts())
print(result1)
result1.plot.bar(figsize=(17,5))


         gender_namsor_adjusted  gender_pygenderbr_adjusted  \
male                       3131                        3057   
female                     2854                        2817   
unknown                     195                         306   

         gender_genderguesser_adjusted  
male                              2758  
female                            2116  
unknown                           1306  
Out[525]:
<matplotlib.axes._subplots.AxesSubplot at 0x12dfbcd30>

Exploratory Data Analisys - EDA

UFRN in numbers by gender


Local of nascence


In the year of 2000 was constituted the Lei de Responsabilidade Fiscal a budget and fiscal law for expenditures in the municipalities of Brazil. One of the main points in this law is about the transparency what is called acitve transparency that is the town hall, the legislative chamber and the other bodies of the government has to open the data in his following transparency portals. After that we had the Lei de Acesso à Informação another law with the aim to make improvments in the transparency portal as well as the access to information and know covering the passive transparency that is when the citizes asks for government information not available in the transparency portal. One one the points that the transparency portal has to obbey is to masks all the private data and it is what happens with http://www.portaldatransparencia.gov.br/ that masks the CPF number of the employees. But knowing the structure of the CPF we could analyse from the last number, excluding the check number (two last digits) we could map from where is the origin of the person. So we made a simple function to label from where the employees of UFRN came from according to https://pt.wikipedia.org/wiki/Cadastro_de_pessoas_f%C3%ADsicas and https://janio.sarmento.org/curiosidade-identificacao-de-cpf-conforme-o-estado/ we do have that the last digit has the correlation bellow:

  • 0 - Rio Grande do Sul</li>
  • 1 - Distrito Federal, Goiás, Mato Grosso, Mato Grosso do Sul e Tocantins</li>
  • 2 - Amazonas, Pará, Roraima, Amapá, Acre e Rondônia</li>
  • 3 - Ceará, Maranhão e Piauí</li>
  • 4 - Paraíba, Pernambuco, Alagoas e Rio Grande do Norte</li>
  • 5 - Bahia e Sergipe</li>
  • 6 - Minas Gerais</li>
  • 7 - Rio de Janeiro e Espírito Santo</li>
  • 8 - São Paulo</li>
  • 9 - Paraná e Santa Catarina</li>

In [68]:
def label_region (row):
    if row['cpf'][10:11] == '1':
        return 'Distrito Federal, Goiás, Mato Grosso, Mato Grosso do Sul e Tocantins'
    if row['cpf'][10:11] == '2':
        return 'Amazonas, Pará, Roraima, Amapá, Acre e Rondônia'
    if row['cpf'][10:11] == '3':
        return 'Ceará, Maranhão e Piauí'
    if row['cpf'][10:11] == '4':
        return 'Paraíba, Pernambuco, Alagoas e Rio Grande do Norte'
    if row['cpf'][10:11] == '5':
        return 'Bahia e Sergipe'
    if row['cpf'][10:11] == '6':
        return 'Minas Gerais'
    if row['cpf'][10:11] == '7':
        return 'Rio de Janeiro e Espírito Santo'
    if row['cpf'][10:11] == '8':
        return 'São Paulo'
    if row['cpf'][10:11] == '9':
        return 'Paraná e Santa Catarina'
    if row['cpf'][10:11] == '0':
        return 'Rio Grande do Sul'
    return ''

In [69]:
# Apply label by CPF locale information
dfUfrnComplete['cpf_region'] = dfUfrnComplete.apply(label_region, axis=1)

In [70]:
ufrnOrigin = dfUfrnComplete['cpf_region'].value_counts(sort=True).reset_index()
ufrnOrigin.columns=['name', 'count']
ufrnOrigin


Out[70]:
name count
0 Paraíba, Pernambuco, Alagoas e Rio Grande do N... 5112
1 São Paulo 238
2 Ceará, Maranhão e Piauí 205
3 Rio de Janeiro e Espírito Santo 173
4 Minas Gerais 99
5 Distrito Federal, Goiás, Mato Grosso, Mato Gro... 84
6 Rio Grande do Sul 81
7 Paraná e Santa Catarina 68
8 Bahia e Sergipe 64
9 Amazonas, Pará, Roraima, Amapá, Acre e Rondônia 56

In [72]:
# import geojson file about natal neighborhood
fiscal_region = os.path.join('geojson', 'mapCPF.geojson')

# load the data and use 'UTF-8'encoding
geo_json_fiscal = json.load(open(fiscal_region,encoding='UTF-8'))

In [73]:
fiscal = []
# list all fiscal regions
for neigh in geo_json_fiscal['features']:
        fiscal.append(neigh['properties']['name'])


Out[73]:
['Paraíba, Pernambuco, Alagoas e Rio Grande do Norte',
 'Rio Grande do Sul',
 'Paraná e Santa Catarina',
 'São Paulo',
 'Bahia e Sergipe',
 'Ceará, Maranhão e Piauí',
 'Rio de Janeiro e Espírito Santo',
 'Minas Gerais',
 'Distrito Federal, Goiás, Mato Grosso, Mato Grosso do Sul e Tocantins',
 'Amazonas, Pará, Roraima, Amapá, Acre e Rondônia']

In [74]:
colorscaleFiscalRegion = linear.OrRd.scale(ufrnOrigin['count'].min(), ufrnOrigin['count'].max())
threshold_scale = [ufrnOrigin['count'].min(), 80, 150, 200,  ufrnOrigin['count'].max()]

In [75]:
# Create a map object
m = folium.Map(
    location=[-14.150767, -51.057477],
    zoom_start=4,
    tiles='cartodbpositron'
)
#
m.choropleth(
    geo_data=geo_json_fiscal,
    data=ufrnOrigin,
    columns=['name', 'count'],
    key_on='feature.properties.name',
    fill_color='OrRd',
    legend_name='UFRN - Employee region of nascence',
    highlight=True,
    threshold_scale = threshold_scale,
    line_color='red',
    line_weight=0.2,
    line_opacity=0.6
)

Bellow we have a choropleth according to the last digit of CPF. This digit represent the fiscal regions extracted from the last tirth most right digit from CPF. As we can notice the region redish is the one with the most nascence region of UFRN's employees and its is the 4th fiscal region. Following the 4th comes the 8th (São Paulo) and 3th (Ceará, Maranhão and Piauí) fiscal regions.


In [76]:
m


Out[76]:

In [198]:
dfUfrnComplete['emp_responsability'].value_counts(sort=True)


Out[198]:
PROFESSOR DO MAGISTERIO SUPERIOR            2107
ASSISTENTE EM ADMINISTRACAO                  709
                                             554
PROFESSOR MAGISTERIO SUPERIOR-SUBSTITUTO     229
AUXILIAR DE ENFERMAGEM                       228
PROFESSOR ENS BASICO TECN TECNOLOGICO        208
TECNICO DE LABORATORIO AREA                  169
MEDICO-AREA                                  167
AUX EM ADMINISTRACAO                         141
VIGILANTE                                    131
ENFERMEIRO-AREA                               98
TECNICO EM ENFERMAGEM                         86
PROFESSOR MAGISTERIO SUPERIOR-TEMPORARIO      81
TEC DE TECNOLOGIA DA INFORMACAO               65
ENGENHEIRO-AREA                               63
TECNICO EM ASSUNTOS EDUCACIONAIS              63
SERVENTE DE LIMPEZA                           62
BIBLIOTECARIO-DOCUMENTALISTA                  58
ADMINISTRADOR                                 38
ASSISTENTE DE LABORATORIO                     37
ASSISTENTE SOCIAL                             34
NUTRICIONISTA-HABILITACAO                     33
PSICOLOGO-AREA                                29
SECRETARIO EXECUTIVO                          28
PROF ENS BAS TEC TECNOLOGICO-SUBSTITUTO       28
AUXILIAR DE LABORATORIO                       25
MOTORISTA                                     22
FARMACEUTICO BIOQUIMICO                       22
JORNALISTA                                    22
ANALISTA DE TEC DA INFORMACAO                 21
                                            ... 
TECNICO EM SANEAMENTO                          2
MARCENEIRO                                     2
ZOOTECNISTA                                    2
COZINHEIRO                                     2
MEDICO                                         2
ATENDENTE DE CONSULTORIO-AREA                  1
TECNICO EM QUIMICA                             1
AAD-AUXILIAR ADMINISTRATIVO                    1
MEDICO - 24H                                   1
PROFESSOR DE 1  E 2  GRAUS                     1
TEC EM INSTRUMENTACAO                          1
MUSICO                                         1
PROCURADOR FEDERAL                             1
TIPOGRAFO                                      1
PESQUISADOR EM SAUDE PUBLICA                   1
COSTUREIRO                                     1
AUXILIAR DE MECANICA                           1
RECREACIONISTA                                 1
COREOGRAFO                                     1
ASSIST TECNICO DE GESTAO EM SAUDE              1
APONTADOR                                      1
DIRETOR DE PROGRAMA                            1
TEC EM MANUTENCAO DE AUDIO VIDEO               1
DIRETOR DE IMAGEM                              1
TECNICO EM AUDIOVISUAL                         1
TEC EM ELETROELETRONICA                        1
OPERADOR DE CALDEIRA                           1
TEC EM TELECOMUNICACAO                         1
PEDREIRO                                       1
REVISOR DE TEXTOS BRAILLE                      1
Name: emp_responsability, dtype: int64

In [213]:
result = dfUfrnComplete.groupby(['emp_responsability','gender_namsor_adjusted'])['emp_responsability'].count().unstack('gender_namsor_adjusted').fillna(0)
result     
#dfUfrnComplete['emp_responsability'].value_counts(sort=True)
#filter = pessoas_tce.groupby(['setor','GENERO'])['setor'].count().unstack('GENERO').fillna(0)
#filter.sum(axis=1).sort_values().plot.barh(figsize=(17,30))


Out[213]:
gender_namsor_adjusted female male unknown
emp_responsability
1.0 343.0 197.0 13.0
AAD-AUXILIAR ADMINISTRATIVO 0.0 1.0 0.0 0.0
ADMINISTRADOR 0.0 14.0 24.0 0.0
ALMOXARIFE 0.0 0.0 2.0 0.0
ANALISTA DE TEC DA INFORMACAO 0.0 3.0 17.0 1.0
APONTADOR 0.0 0.0 1.0 0.0
ARMAZENISTA 0.0 0.0 2.0 0.0
ARQUITETO E URBANISTA 0.0 5.0 8.0 0.0
ARQUIVISTA 0.0 2.0 1.0 0.0
ASSIST TECNICO DE GESTAO EM SAUDE 0.0 1.0 0.0 0.0
ASSISTENTE DE ALUNO 0.0 10.0 6.0 0.0
ASSISTENTE DE DIRECAO E PRODUCAO 0.0 1.0 2.0 0.0
ASSISTENTE DE LABORATORIO 0.0 17.0 17.0 3.0
ASSISTENTE EM ADMINISTRACAO 1.0 285.0 397.0 26.0
ASSISTENTE SOCIAL 0.0 29.0 3.0 2.0
ATENDENTE DE CONSULTORIO-AREA 0.0 1.0 0.0 0.0
ATENDENTE DE ENFERMAGEM 0.0 1.0 1.0 0.0
AUDITOR 0.0 0.0 3.0 0.0
AUX DE VETERINARIA E ZOOTECNIA 0.0 1.0 1.0 0.0
AUX EM ADMINISTRACAO 0.0 63.0 76.0 2.0
AUXILIAR DE ANATOMIA E NECROPSIA 0.0 0.0 4.0 0.0
AUXILIAR DE BIBLIOTECA 0.0 0.0 2.0 0.0
AUXILIAR DE COZINHA 0.0 2.0 0.0 0.0
AUXILIAR DE CRECHE 0.0 7.0 5.0 1.0
AUXILIAR DE ELETRICISTA 0.0 0.0 2.0 0.0
AUXILIAR DE ENFERMAGEM 0.0 173.0 44.0 11.0
AUXILIAR DE FARMACIA 0.0 4.0 4.0 1.0
AUXILIAR DE LABORATORIO 0.0 10.0 12.0 3.0
AUXILIAR DE MECANICA 0.0 0.0 1.0 0.0
AUXILIAR DE NUTRICAO E DIETETICA 0.0 5.0 3.0 0.0
... ... ... ... ...
TEC EM SEGURANCA DO TRABALHO 0.0 0.0 4.0 0.0
TEC EM TELECOMUNICACAO 0.0 0.0 1.0 0.0
TECNICO DE LABORATORIO AREA 1.0 67.0 92.0 9.0
TECNICO EM AGROPECUARIA 0.0 1.0 4.0 0.0
TECNICO EM ALIMENTOS E LATICINIOS 0.0 3.0 0.0 0.0
TECNICO EM ARQUIVO 0.0 1.0 2.0 1.0
TECNICO EM ARTES GRAFICAS 0.0 0.0 5.0 0.0
TECNICO EM ASSUNTOS EDUCACIONAIS 1.0 35.0 26.0 1.0
TECNICO EM AUDIOVISUAL 0.0 0.0 1.0 0.0
TECNICO EM CONTABILIDADE 0.0 6.0 11.0 0.0
TECNICO EM EDIFICACOES 0.0 2.0 7.0 0.0
TECNICO EM ELETROMECANICA 0.0 0.0 2.0 0.0
TECNICO EM ELETRONICA 0.0 0.0 4.0 1.0
TECNICO EM ELETROTECNICA 0.0 1.0 16.0 0.0
TECNICO EM ENFERMAGEM 0.0 65.0 18.0 3.0
TECNICO EM GEOLOGIA 0.0 1.0 1.0 0.0
TECNICO EM HIGIENE DENTAL 0.0 2.0 1.0 0.0
TECNICO EM MECANICA 0.0 0.0 5.0 0.0
TECNICO EM QUIMICA 0.0 0.0 1.0 0.0
TECNICO EM RADIOLOGIA 0.0 6.0 14.0 1.0
TECNICO EM SANEAMENTO 0.0 0.0 1.0 1.0
TECNICO EM SECRETARIADO 0.0 5.0 2.0 1.0
TECNOLOGO-FORMACAO 0.0 3.0 13.0 1.0
TELEFONISTA 0.0 4.0 1.0 0.0
TIPOGRAFO 0.0 0.0 1.0 0.0
TRADUTOR INTERPRETE 0.0 2.0 1.0 0.0
TRADUTOR INTERPRETE DE LINGUAGEM SINAIS 0.0 6.0 6.0 0.0
VESTIARISTA 0.0 2.0 0.0 0.0
VIGILANTE 0.0 2.0 129.0 0.0
ZOOTECNISTA 0.0 0.0 2.0 0.0

155 rows × 4 columns


In [216]:
result.sum(axis=1).sort_values().plot.barh(stacked=True,figsize=(17,30))


Out[216]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a650e10>

In [220]:
result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['emp_UORG','gender_namsor_adjusted'])['emp_UORG'].count().unstack('gender_namsor_adjusted').fillna(0)
result


Out[220]:
gender_namsor_adjusted female male unknown
emp_UORG
0.0 8.0 8.0 0.0
ADMINISTRACAO DO NUCLEO DE NOVA CRUZ 0.0 0.0 1.0 0.0
CENTRO DE CIENCIAS JUR. E SOC. APLICADAS 0.0 1.0 0.0 0.0
DEPARTAMENTO DE ANTROPOLOGIA 0.0 2.0 1.0 0.0
DEPARTAMENTO DE ARQUITETURA 0.0 1.0 3.0 0.0
DEPARTAMENTO DE ARTES 0.0 3.0 7.0 0.0
DEPARTAMENTO DE BIOQUIMICA 0.0 2.0 1.0 0.0
DEPARTAMENTO DE CIENCIA DA INFORMACAO 0.0 2.0 1.0 0.0
DEPARTAMENTO DE CIENCIAS ADMINISTRATIVAS 0.0 1.0 6.0 0.0
DEPARTAMENTO DE CIENCIAS CONTABEIS 0.0 1.0 5.0 1.0
DEPARTAMENTO DE CIENCIAS SOCIAIS 0.0 1.0 5.0 0.0
DEPARTAMENTO DE CIRURGIA 0.0 1.0 3.0 0.0
DEPARTAMENTO DE COMUNICACAO SOCIAL 0.0 3.0 5.0 0.0
DEPARTAMENTO DE DIREITO PRIVADO 0.0 2.0 2.0 0.0
DEPARTAMENTO DE DIREITO PUBLICO 0.0 0.0 3.0 0.0
DEPARTAMENTO DE DIREITO-CERES 0.0 0.0 2.0 0.0
DEPARTAMENTO DE ECOLOGIA 0.0 2.0 0.0 0.0
DEPARTAMENTO DE ECONOMIA 0.0 1.0 4.0 0.0
DEPARTAMENTO DE EDUCACAO FISICA 0.0 1.0 7.0 0.0
DEPARTAMENTO DE EDUCACAO-CERES 0.0 2.0 1.0 0.0
DEPARTAMENTO DE ENFERMAGEM 0.0 8.0 0.0 0.0
DEPARTAMENTO DE ENGENHARIA BIOMEDICA 0.0 0.0 1.0 1.0
DEPARTAMENTO DE ENGENHARIA CIVIL 0.0 6.0 6.0 0.0
DEPARTAMENTO DE ENGENHARIA DE MATERIAIS 0.0 1.0 4.0 0.0
DEPARTAMENTO DE ENGENHARIA DE PETROLEO 0.0 1.0 1.0 1.0
DEPARTAMENTO DE ENGENHARIA DE PRODUCAO 0.0 1.0 3.0 0.0
DEPARTAMENTO DE ENGENHARIA ELETRICA 0.0 0.0 4.0 0.0
DEPARTAMENTO DE ENGENHARIA MECANICA 0.0 0.0 4.0 0.0
DEPARTAMENTO DE ENGENHARIA QUIMICA 0.0 2.0 3.0 0.0
DEPARTAMENTO DE ENGENHARIA TEXTIL 0.0 1.0 2.0 0.0
... ... ... ... ...
DEPARTAMENTO DE TOCO-GINECOLOGIA 0.0 3.0 0.0 0.0
DEPARTAMENTO DE TURISMO 0.0 0.0 3.0 1.0
DEPT DE DIREITO PROCESSUAL PROPEDEUTICA 0.0 2.0 1.0 0.0
DEPTO DE ANALISES CLINIC E TOXICOLOGICAS 0.0 2.0 0.0 0.0
DEPTO DE BIOFISICA E FARMACOLOGIA 0.0 1.0 5.0 0.0
DEPTO DE BIOLOGIA CELULAR E GENETICA 0.0 3.0 1.0 0.0
DEPTO DE BOTANICA E ZOOLOGIA 0.0 1.0 3.0 0.0
DEPTO DE CIENCIAS ATMOSFERICAS E CLIMATI 0.0 1.0 1.0 0.0
DEPTO DE CIENCIAS EXATAS E APLICADAS 0.0 1.0 4.0 0.0
DEPTO DE CIENCIAS SOCIAIS E HUMANAS 0.0 2.0 0.0 1.0
DEPTO DE DEMOGRAFIA E CIENCIAS ATUARIAIS 0.0 1.0 2.0 0.0
DEPTO DE ENG DE COMPUTACAO E AUTOMACAO 0.0 0.0 6.0 0.0
DEPTO DE ENGENHARIA DE COMUNICACOES 0.0 0.0 4.0 0.0
DEPTO DE FISICA TEORICA E EXPERIMENTAL 0.0 2.0 6.0 0.0
DEPTO DE FUND E POLITICAS DA EDUCACAO 0.0 3.0 1.0 0.0
DEPTO DE INFORMATICA E MAT APLICADA 0.0 2.0 6.0 0.0
DEPTO DE LING E LITER ESTRANG MODERNAS 0.0 2.0 4.0 0.0
DEPTO DE MICROBIOLOGIA E PARASITOLOGIA 0.0 5.0 1.0 1.0
DEPTO DE OCEANOGRAFIA E LIMNOLOGIA 0.0 1.0 2.0 0.0
DEPTO DE PRAT EDUCACIONAIS E CURRICULO 0.0 6.0 3.0 0.0
ESCOLA AGRICOLA DE JUNDIAI - UAECA 0.0 2.0 7.0 0.0
ESCOLA DE CIENCIAS E TECNOLOGIA - UAECT 0.0 4.0 4.0 0.0
ESCOLA DE MUSICA - UAEM 0.0 1.0 1.0 0.0
ESCOLA MULTICAMPI DE CIENCIAS MEDICAS 0.0 3.0 3.0 0.0
FACULDADE DE CIENCIAS DA SAUDE DO TRAIRI 0.0 4.0 7.0 0.0
INSTITUTO DE QUIMICA 0.0 5.0 4.0 0.0
INSTITUTO DO CEREBRO 0.0 3.0 3.0 0.0
INSTITUTO METROPOLE DIGITAL 0.0 1.0 4.0 0.0
UNIVERSIDADE FED. DO RIO GRANDE DO NORTE 5.0 708.0 902.0 44.0
UNIVERSIDADE FEDERAL DE ALAGOAS 0.0 1.0 0.0 0.0

88 rows × 4 columns


In [221]:
result.sum(axis=1).sort_values().plot.barh(stacked=True,figsize=(17,30))


Out[221]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a917e10>

In [235]:
result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['cpf_region','gender_namsor_adjusted'])['cpf_region'].count().unstack('gender_namsor_adjusted').fillna(0)
result


Out[235]:
gender_namsor_adjusted female male unknown
cpf_region
Amazonas, Pará, Roraima, Amapá, Acre e Rondônia 0.0 13.0 11.0 3.0
Bahia e Sergipe 0.0 12.0 23.0 1.0
Ceará, Maranhão e Piauí 0.0 41.0 58.0 0.0
Distrito Federal, Goiás, Mato Grosso, Mato Grosso do Sul e Tocantins 1.0 19.0 30.0 2.0
Minas Gerais 0.0 33.0 40.0 1.0
Paraná e Santa Catarina 0.0 20.0 30.0 4.0
Paraíba, Pernambuco, Alagoas e Rio Grande do Norte 3.0 581.0 765.0 41.0
Rio Grande do Sul 1.0 38.0 30.0 1.0
Rio de Janeiro e Espírito Santo 1.0 43.0 69.0 1.0
São Paulo 0.0 89.0 101.0 1.0

Distribution according to local of nascence, CPF fiscal region. We can see tha only 'Rio Grande do Sul' and 'Amazonas,Pará, Roraima, Amapá e Rondonia' fiscal regions that there are more females than male professors.


In [263]:
result[['male','female']].plot.bar(color=sns.color_palette(),figsize=(17,10))


Out[263]:
<matplotlib.axes._subplots.AxesSubplot at 0x12c79dda0>

In [ ]:
#result = dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].groupby(['cpf_region','gender_namsor_adjusted'])['cpf_region'].count().unstack('gender_namsor_adjusted').fillna(0)
#result
dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')]

In [78]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor'] =='')]['first_name'].value_counts(sort=True)


Out[78]:
EUSAMAR      1
MARGARIDA    1
PAULO        1
JULIANA      1
SAVIO        1
SEBASTIAN    1
SANDRO       1
ANDERSON     1
UMBERTO      1
MURILLO      1
ADALBERTO    1
DANIEL       1
MARCUS       1
TATIANE      1
Name: first_name, dtype: int64

We will adjust the names that are unkown in the NAMSOR servise manually. For this, we will all first name finished in 'o', 'os', 'on', 'u', 'us' as male names. The other names, like finished in 'e', 'i', and etc do not follow a constant like male names do with o and u. However, there is a list of know names finished in 'o' that are female names and are:

  • Aiko
  • Amparo
  • Anunciação
  • Assunção
  • Calipso
  • Carmo
  • Carminho
  • Cléo
  • Charo
  • Clio
  • Conceição
  • Consolação
  • Consuelo
  • Dido
  • Erato
  • Índigo
  • Ino
  • Io
  • Izaro
  • Juno
  • Keiko
  • Lerato
  • Leto
  • Lilo
  • Lucero
  • Margô
  • Mirto
  • Purificação
  • Rocío
  • Rosário
  • Rosarinho
  • Socorro
  • Tamiko
  • Tariro
  • Temisto
  • Yoko
so for this we could seach for the first name:


In [111]:
dfUfrnComplete[dfUfrnComplete['first_name'].str.contains("AIKO | AMPARO | ANUNCIAÇÃO | ASSUNÇÃO | CALIPSO | CARMO  | CARMINHO | CLÉO | CHARO | CLIO | CONCEIÇÃO | CONSOLAÇÃO | CONSUELO | DIDO | ERATO | ÍNDIGO | INO | IO | IZARO | JUNO | KEIKO | LERATO | LETO | LILO | LUCERO | MARGÔ | MIRTO | PURIFICAÇÃO | ROCÍO | ROSÁRIO | ROSARINHO | SOCORRO | TAMIKO | TARIRO | TEMISTO | YOKO")==True]


Out[111]:
cpf hrefLevel2 name emp_enrollment emp_responsability emp_responsability_class emp_responsability_pattern emp_responsability_reference emp_responsability_level emp_UF ... emp_paycheck_after_deduction emp_other_receivings e1 e2 e3 first_name last_name gender_pygenderbr gender_namsor cpf_region

0 rows × 46 columns

as we can see there is no name of the given list se now we can generalize and say that the names that fall in the rule above described are male names


In [112]:
dfUfrnComplete['gender_namsor_adjusted'] = dfUfrnComplete['gender_namsor']

In [155]:
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted']#= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] #= 'male'
#dfUfrnComplete.loc[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown'),'gender_namsor_adjusted'] = 'male'

In [159]:
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'O') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'ON') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-1] == 'U') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'US') & (dfUfrnComplete['gender_namsor'] =='unknown')]
#dfUfrnComplete[(dfUfrnComplete['first_name'].str[-2:] == 'OS') & (dfUfrnComplete['gender_namsor'] =='unknown')]
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[159]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

In [168]:
#dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY | AISLANIA | ALCIONE | ALMARIA | AMALUSIA | ANEIDE | ANELLYSA | AURIGENA | AUZELIVIA | BARNORA | CLAUDIANNY | CRISLUCI | CRISTHIANNE | DERISCLEIA | EDILZA | EDZANA | SANZIA | SAONARA | SEMELY | SHEYLENA | SISLLEY | SONAYDY | SORANEIDE | SUELENE | SUENI | SUENIA | SULEMI | SUZERICA | TAIZA | THAISE | THAIZA | THATYANE | VALDECY | VALDENIA | WALANNE | WALDENICE | WANDERLEIA | WANUSIA | WICLIFFE | YULYANNA | ZORAIDE ")==True]['gender_namsor_adjusted'] = 'female'
dfUfrnComplete.loc[dfUfrnComplete['first_name'].str.contains("ACYNELLY|AISLANIA|ALCIONE|ALMARIA|AMALUSIA|ANEIDE|ANELLYSA|AURIGENA|AUZELIVIA|BARNORA|CLAUDIANNY|CRISLUCI|CRISTHIANNE|DERISCLEIA|EDILZA|EDZANA|SANZIA|SAONARA|SEMELY|SHEYLENA|SISLLEY|SONAYDY|SORANEIDE|SUELENE|SUENI|SUENIA|SULEMI|SUZERICA|TAIZA|THAISE|THAIZA|THATYANE|VALDECY|VALDENIA|WALANNE|WALDENICE|WANDERLEIA|WANUSIA|WICLIFFE|YULYANNA|ZORAIDE")]['gender_namsor_adjusted'] = 'female'


/Users/marco/anaconda/lib/python3.6/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [161]:
dfUfrnComplete[(dfUfrnComplete['gender_namsor_adjusted'] =='unknown')]['first_name']


Out[161]:
8          ACYNELLY
50          ADLAREG
127        AISLANIA
146         ALBIMAR
147         ALBIMAR
152         ALCIONE
265         ALMARIA
286          ALZENI
290        AMALUSIA
468       ANDERSONN
551          ANEIDE
554        ANELLYSA
610      ANTONIELLI
743        AURIGENA
747       AUZELIVIA
754          AZEMAR
766         BARNORA
767         BARTIRA
794          BONIEK
1027          CHENG
1087     CLAUDIANNY
1143       CRISLUCI
1144    CRISTHIANNE
1324     DERISCLEIA
1329       DEUSIMAR
1381        DJAILDE
1390      DOMINIQUE
1426         EDILZA
1509         EDZANA
1533         ELEDIR
           ...     
5523         SANZIA
5524        SAONARA
5568         SEMELY
5628       SHEYLENA
5673        SISLLEY
5680        SONAYDY
5689      SORANEIDE
5702        SUELENE
5709          SUENI
5710         SUENIA
5713         SULEMI
5724       SUZERICA
5740          TAIZA
5820         THAISE
5823         THAIZA
5832       THATYANE
5924        VALDECY
5927       VALDENIA
6061      WADLEIGHN
6069        WALANNE
6072      WALDENICE
6079        WALLACY
6090     WANDERLEIA
6094        WANUSIA
6100         WEDNEY
6118       WICLIFFE
6120         WILACI
6154       YULYANNA
6166     ZANKENNEDY
6178        ZORAIDE
Name: first_name, dtype: object

Position of trust x Gender



In [264]:
dfUfrnComplete.columns


Out[264]:
Index(['cpf', 'hrefLevel2', 'name', 'emp_enrollment', 'emp_responsability',
       'emp_responsability_class', 'emp_responsability_pattern',
       'emp_responsability_reference', 'emp_responsability_level', 'emp_UF',
       'emp_UORG', 'emp_legal_regime', 'emp_activit', 'emp_absence_from_work',
       'emp_work_time', 'emp_last_job_responsability_modification_date',
       'emp_job_responsability_nominee_date',
       'emp_job_responsability_nominee_act',
       'emp_last_job_responsability_modification_date_body',
       'emp_entrance_public_service_legal_document',
       'emp_entrance_public_service_legal_document_number',
       'emp_entrance_public_service_legal_document_date',
       'responsability_description', 'responsability_activity',
       'responsability_activity_parcial', 'responsability_UF',
       'responsability_UORG', 'responsability_sup_body',
       'responsability_last_data_change_resp', 'url_level_three', 'emp_month',
       'emp_year', 'emp_tot_paycheck', 'emp_irrf', 'emp_rgps',
       'emp_other_deduction', 'emp_paycheck_after_deduction',
       'emp_other_receivings', 'e1', 'e2', 'e3', 'first_name', 'last_name',
       'gender_pygenderbr', 'gender_namsor', 'cpf_region',
       'gender_namsor_adjusted', 'gender_genderguesser'],
      dtype='object')

In [279]:
result = dfUfrnComplete[(dfUfrnComplete['responsability_description'].str.len() >3)&((dfUfrnComplete['gender_namsor_adjusted'] == 'male')|(dfUfrnComplete['gender_namsor_adjusted'] == 'female'))].groupby(['responsability_description','gender_namsor_adjusted'])['responsability_description'].count().unstack('gender_namsor_adjusted').fillna(0)
result.plot.bar(figsize=(17,10))


Out[279]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b8c7e80>

In [280]:
print(result)


gender_namsor_adjusted                              female   male
responsability_description                                       
CD  000.1 - CARGO DE  DIRECAO - CD - IFES              1.0    0.0
CD  000.2 - CARGO DE  DIRECAO - CD - IFES              3.0    5.0
CD  000.3 - CARGO DE  DIRECAO - CD - IFES             18.0   20.0
CD  000.4 - CARGO DE  DIRECAO - CD - IFES             18.0   34.0
DAS 101.4 - DIRECAO E ASSESSORAMENTO SUPERIOR          0.0    1.0
DAS 102.4 - DIRECAO E ASSESSORAMENTO SUPERIOR          0.0    1.0
FG  000.1 - FUNCAO GRATIFICADA - IFES                 43.0   66.0
FG  000.2 - FUNCAO GRATIFICADA - IFES                 33.0   55.0
FG  000.3 - FUNCAO GRATIFICADA - IFES                 17.0    7.0
FG  000.4 - FUNCAO GRATIFICADA - IFES                 23.0   27.0
FG  000.5 - FUNCAO GRATIFICADA - IFES                  5.0    9.0
FG  000.6 - FUNCAO GRATIFICADA - IFES                  4.0    6.0
FG  000.7 - FUNCAO GRATIFICADA - IFES                 24.0   31.0
FPE 101.2 - FUNCAO COMISSIONADA DO PODER EXECUTIVO     0.0    1.0
FUC 000.1 - FUNCAO COMISSIONADA DE COORD. CURSO       90.0  119.0
GF  002.1 - FUNCAO GRATIFICADA                         0.0    1.0
GF  002.2 - FUNCAO GRATIFICADA                         1.0    0.0
GF  002.5 - FUNCAO GRATIFICADA                         0.0    2.0
GF  002.7 - FUNCAO GRATIFICADA                         1.0    5.0

In [302]:
result['f_percentage'] = result['female']+result['male']
result['m_percentage'] = result['female']+result['male']
result['f_percentage'] = result['female']/result['f_percentage']
result['m_percentage'] = result['male']/result['m_percentage']
result['f_percentage'] = result['f_percentage']*100
result['m_percentage'] = result['m_percentage']*100

In [305]:
result[['f_percentage','m_percentage']].plot.bar(figsize=(17,10))


Out[305]:
<matplotlib.axes._subplots.AxesSubplot at 0x12b9e2d68>

In [308]:
print(result[['f_percentage', 'm_percentage']].mean())
result[['f_percentage', 'm_percentage']].mean().plot.bar(color=sns.color_palette(),figsize=(17,10))


gender_namsor_adjusted
f_percentage    36.439274
m_percentage    63.560726
dtype: float64
Out[308]:
<matplotlib.axes._subplots.AxesSubplot at 0x12ba88240>

In [ ]:
dfUfrnComplete.emp_tot_paycheck = dfUfrnComplete.emp_tot_paycheck.astype(dtype="float")
dfUfrnComplete.emp_tot_paycheck.describe()

In [505]:
dfUfrnComplete.emp_tot_paycheck.copy().sort_values(ascending=False)


Out[505]:
1710    46284.99
2764    44634.77
4468    43196.19
4605    42474.74
2201    42045.57
3881    41844.59
4114    41645.74
13      41395.15
3792    40862.62
2766    40573.57
4232    40162.95
786     39996.44
723     39228.01
3698    39073.82
4073    39037.94
4245    38484.95
2936    38098.47
1548    37771.61
5138    37603.55
621     37358.16
5375    36870.87
4133    36697.80
2029    36178.54
4144    35983.63
277     34440.39
3050    33796.19
1649    33016.24
3761    33001.21
3762    33001.21
5804    32610.44
          ...   
210         0.00
199         0.00
183         0.00
1270        0.00
1291        0.00
3282        0.00
2387        0.00
3218        0.00
2974        0.00
2932        0.00
130         0.00
2587        0.00
2570        0.00
2499        0.00
2466        0.00
2450        0.00
2449        0.00
2320        0.00
1664        0.00
2319        0.00
2296        0.00
2244        0.00
2184        0.00
2096        0.00
1877        0.00
1823        0.00
1817        0.00
1788        0.00
1671        0.00
0           0.00
Name: emp_tot_paycheck, dtype: float64

In [506]:
dfUfrnComplete[(dfUfrnComplete['emp_responsability']=='PROFESSOR DO MAGISTERIO SUPERIOR')].emp_tot_paycheck.copy().sort_values(ascending=False)


Out[506]:
1710    46284.99
2764    44634.77
4468    43196.19
4605    42474.74
3881    41844.59
4114    41645.74
13      41395.15
4232    40162.95
786     39996.44
723     39228.01
3698    39073.82
4073    39037.94
4245    38484.95
2936    38098.47
5138    37603.55
621     37358.16
277     34440.39
3050    33796.19
1649    33016.24
3761    33001.21
5804    32610.44
4261    31868.58
4284    31781.43
4811    31651.75
2931    31579.84
4658    31508.64
4312    31147.91
1769    30913.04
1119    30854.76
4012    30689.75
          ...   
983         0.00
6087        0.00
598         0.00
1019        0.00
1135        0.00
6036        0.00
6104        0.00
1664        0.00
4756        0.00
5601        0.00
496         0.00
493         0.00
4234        0.00
4078        0.00
4076        0.00
4014        0.00
3941        0.00
3908        0.00
3767        0.00
3691        0.00
3647        0.00
3282        0.00
2974        0.00
2932        0.00
2570        0.00
2319        0.00
2296        0.00
1877        0.00
490         0.00
0           0.00
Name: emp_tot_paycheck, dtype: float64

Final marks


In Brazil a rule is to make a contest to select the best possible employee available or in some cases there is the figure of 'Cargo em Comissão' is a position of trust and the elected one could choose whoever it wants. In the Federal Government there are no distinctions about gender. Every employee follow a path to carrer growth and technically if a person follow a determined path there is not possibility to salary distinctions based on gender. So we did not considered evaluate the amount of salary based on gender. Therefore, we considered that in the public service exists positions of trust that receive a plus in the salary. As we could detect in the last section there is a difference that is almost 30% or to be precise:

There is an average of 27.12% more man in positions of trust than woman!

Another interesting fact is that at least 1068 people were born in another states it represents 17,28% of the total of UFRN's employees. At least beacuse the fiscal region is composed by RN and another trhree states.

At least 17.28% UFRN's employees were born in another state besides Rio Grande do Norte!

We also detected that are a good number of employees that passes the limit of how much a government employee is allowed by law to receive. From the TOP30 paychecks only 9 were from administratives jobs all the others are Professors.

In general the proportion of gender is 6180

Tool Male(%) Female(%) Unknown(%)
Namsor 50.66 46.18 3.15
(Py)Gender BR 49.46 45.58 4.95
Gender Guesser 44.62 34.23 21.13

As we can see, the guess the gender of a name is highly dependable from which country is associated. The gender guesser, a tool that we did not specified the origin of the names could not be effective for this scenario.